Single Water molecule

To get clear understanding of what we am trying to to achieve, we will start off with a single water molecule.

In [2]:
# Load all packages, some may not be needed.

import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import os
import dscribe
from ase.build import molecule
from dscribe.descriptors import SOAP
from ase.build import bulk
from ase.visualize import view
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import Ridge
from ase.io import read, write
from ase import Atom, Atoms
import time
import pickle
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as utils_data
from torch.autograd import Variable
from ase.io.cube import read_cube, write_cube
from ase.visualize.plot import plot_atoms
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import explained_variance_score, mean_squared_error
import os
import py3Dmol
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score

We will be defining functions on the jupyter notebook for this first project. For furture projects we will be using trasferring these into libraries.

Generating points in a box

In [3]:
Bohr2Ang = 0.529177
def create_box(gamma, s11_b, s21_b, s22_b, s33_b, x_ini, x_fin, y_ini, y_fin,
               z_ini, z_fin, stepX=1, stepY=1, stepZ=1):
    '''
    This function generates the grid points in a box based on the shape of the cell
    that is defined. The function in NOT generalized to ALL cell shapes. The input 
    parameters are:
    gamma: Gamma angle formed
    *All s*_b denote lattice/ cell parameters. Input in Angstroms*
    s11_b: XX
    s21_b: XY
    s22_b: YY
    s33_b: ZZ
    x_ini: Initial X (number of points) 
    x_fin: Final X
    y_ini: Initial Y
    y_fin: Final Y
    z_ini: Initial Z
    z_fin: Final Z
    stepX: Steps along X direction. Used to reduce number of grid points
    stepY: Steps along Y direction
    stepZ: Steps along Z direction
    
    TODO: Write a FULLY generalized function to read 
    
    '''

    gamma_rad = np.pi*gamma/180
    Bohr2Ang = 0.529177
    s11 = s11_b * Bohr2Ang
    s21 = s21_b * Bohr2Ang
    s22 = s22_b * Bohr2Ang
    s33 = s33_b * Bohr2Ang
    
    sinGamma = np.sin(gamma_rad)
    cosGamma = np.cos(gamma_rad)
    s_sqrt = np.sqrt(s21**2 + s22**2)  # Look at the equation

    # TODO: Is there a more general expression for this?
    box = []

    for i in np.arange(x_ini,x_fin,stepX):
        for j in np.arange(y_ini,y_fin,stepY):
            for k in np.arange(z_ini,z_fin,stepZ):
                x = i*s11 + s_sqrt*cosGamma*j     #Had to change the cos to sin
                y = s_sqrt*sinGamma*j             #Had to change the sin to cos
                z = k*s33
                box.append((x,y,z))
                
    return box

With this $\texttt{create_box}$ function created, we would need few things to get the input for creating the grid points in a box. These parameters can be obtained by using a sample cube file as the input. A cube file has all the information that we need. We will be using ASE to read a cube file and obtain parameters.

In [4]:
Cubefile = lambda x: f'electron_den_data/water/eden_{x}.cube' # sample cube file for water

# x denotes the image number (starts from 0)
dictCube_test = read_cube(Cubefile(1))           #Reads to a dict
center_test = dictCube_test['atoms']   # Atom centers from ase Atoms format

# multiply identity matrix to get cross terms, which are zero. 
cell_sample = center_test.get_cell()*np.identity(3)

We still need to obtain the number of grid points along each of the three directions. This information could not be directly retrieved by the ASE function $\texttt{read_cube}$. Hence, we will be writing a function for it.

In [10]:
def get_NoPoints_spacing_matrix(path):
    file=open(path,'r')
    lines = file.readlines()
    noPoints = (int(lines[3].split()[0]), int(lines[4].split()[0]), int(lines[5].split()[0]))
    spacing_matrix = np.array([lines[3].split()[1:],
                               lines[4].split()[1:],
                               lines[5].split()[1:]])
    return noPoints, spacing_matrix.astype(float)
In [17]:
# Here we obtained the total number pf points along x, y and z and the
# spacing matrix
noPoints, cell_spacing = get_NoPoints_spacing_matrix(Cubefile(1))
x_f, y_f, z_f = noPoints

# Since the box is cube (no cross terms, the gamma angle is 90 deg)
water_box = create_box(90, 
                      cell_spacing[0,0],
                      cell_spacing[1,0],
                      cell_spacing[1,1],
                      cell_spacing[2,2],
                      x_ini=0,
                      x_fin=x_f,
                      y_ini=0,
                      y_fin=y_f,
                      z_ini=0,
                      z_fin=z_f)

x_cell=cell_spacing[:,0]
y_cell=cell_spacing[:,1]
z_cell=cell_spacing[:,2]
diffV=np.dot(x_cell,np.cross(y_cell,z_cell))

The next task would be to see if what we generated is correct. One quick easy way to ensure this would be to compare the total number of points that was generated by my function and the total number of points there are in a sample electron density cube file.

NOTE: The $\texttt{create_cube}$ function returns a flattened 1-D vector of points instead of a 3-D matrix of these points. This is will be convinient for generate SOAP vectors, which will be discussed later.

In [6]:
# Validating the size of box that we generated 
eden=dictCube_test['data'].flatten()
print(f'Total number of points from cube file {len(eden)}')
print(f'Total number of points we generated using create_box: {len(water_box)}')
Total number of points from cube file 68921
Total number of points we generated using create_box: 68921

Thus, we can validate that the parameters used to generate the box points were correct.

Visualization: 3Dmol

We give a lot of emphasis on visulaization since there are isosurfaces that are involved. I will be using the python notebook extension for 3Dmol https://colab.research.google.com/drive/1T2zR59TXyWRcNxRgOAiqVPJWhep83NV_?usp=sharing#scrollTo=L0V8SwajGx5U. It renders atomic visuals right after the cell where it has been executed. For just atomic coordinates, we would have to retrieve nuclei positions from a cube file and then save it as an xyz file. Then for 3Dmol, we would need to read the xyz file as a formatted string, which goes in as input to the addModel method.

In [28]:
def cube2xyz(cubepath, outpath):
    '''
    Function that takes in cube files and converts then to xyz files.
    '''
    dictCube_test = read_cube(cubepath)   #Reads to a dict
    center_test = dictCube_test['atoms']     #Atom centers from ase Atoms format
    write(outpath,center_test,format='xyz')
In [8]:
water_xyz=open('electron_den_data/water/eden_1.xyz').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
# xyzview.setBackgroundColor('0xeeeeee')
# xyzview.animate({'loop': 'backAndForth'})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

The next step would be to use one of our DFT generated electron density data (cube files) for visualization. The only additional command would be to add in the volumetric data with the cube file as a string as the input.

In [9]:
water_cube=open('electron_den_data/water/eden_1.cube').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube,'cube',{'isoval': 0.03,"color":'grey','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

Smooth Overlap of Atomic positions

Each point in space needs to be mapped to the correspoinding electron density. We do this by mapping the atomic environment for each point in space using a descriptor. A descriptor is something can enables us to encode regions of atomic geometries using some form of chemical information in that region. We use the Smooth Overlap of Atomic positions (SOAP) formalism that uses local expansions of a gaussian smeared atomic density with orthonormal functions based on spherical harmonics and radial basis functions. The DScribe package was used to generate this SOAP representation of atomic structures (https://singroup.github.io/dscribe/latest/index.html)

TODO: Add a cartoon where you show how the SOAP is used to generate images.

We will first generate a SOAP function for water. The information needed are as follows:

  1. Species [H, O]
  2. Cut off radies (rcut)
  3. Number of radial basis functions used
  4. Maximum degree of spherical harmonics
  5. If the system is periodic or not

These are just the essentials to generate a SOAP function of a system. There are other parameters in here (https://singroup.github.io/dscribe/latest/tutorials/descriptors/soap.html) that can be included to increase the quality of SOAP.

In [10]:
species = ["H", "O"]   
rcut = 5      # Cut off radius in Angs
nmax = 4      # number of radial basis functions
lmax = 4      # number of spherical harmonics

# Setting up the SOAP descriptor
soap_water = SOAP(
    species=species,
    periodic=True,
    rcut=rcut,
    nmax=nmax,
    lmax=lmax,
)

This SOAP function $\texttt{soap_water}$ can be used to get the SOAP vectors for any point in space. Conventionally, SOAP is used atom centric approaches of machine learning (ML) interatomic potetnial development, like GAP. Instead of soley using it at the center of each atom, we would generate this SOAP for each grid point in space.

The input for this SOAP function are carterisan grid points and the location of atomic centers. So, the idea is that for each geometric arrangement of water, we would generate encode the atomic environment by generating "SOAP vectors" ( $\vec{S}(\vec{r})$ ) for each grid point. This $\vec{S}$ does not have a fixed length (size). It is dependent on the number of atomic species, $\texttt{nmax}$ and $\texttt{lmax}$. The Dscribe webpage (https://singroup.github.io/dscribe/latest/tutorials/descriptors/soap.html) has a more detailed explanation about the generation of SOAP. This information can be used to estimate the total size of $\vec{S}$, which may become useful for dimensionality reduction etc.

Now to get an understanding of how $\vec{S}$ should look like around the water molecules, we will select just single point in space and then try to see what happens. I have defined a function called $\texttt{addTrackeratom}$ that takes in the atomic centers of a molecule, the point in space and then generates a new atomic coordinate file where the point is represented as a Ne atom.

In [11]:
def addTrackeratom(cubepath, trackerPos, tempTrackerpath, outFormat='xyz'):
    '''
    Function that takes in the path to a cube file, takes in the nuclei coordinates
    and a tracker atom, then outputs a temporary xyz file. The tracker atom is set 
    to He by defalut.
    '''
    dictCube_test = read_cube(cubepath)   #Reads to a dict
    center_test = dictCube_test['atoms']     #Atom centers from ase Atoms format
    center_test.append(Atom('He'))        # Add Ne atom at the end of array
    pos=center_test.get_positions()       # get atomic positions
    pos[-1] = trackerPos                # Set track atom position as Ne atom
    center_test.set_positions(pos)      # Set new atomic positions
    write(tempTrackerpath, center_test, format=outFormat) # save temp tracker 
In [12]:
# Move the tracker point around by changing the atomic coordinates of trackerPos
trackerPos = np.array([10,0,0])     

addTrackeratom(Cubefile(1), trackerPos=trackerPos, tempTrackerpath='electron_den_data/water/track_1_temp.xyz', outFormat='xyz')
water_xyz=open('electron_den_data/water/track_1_temp.xyz').read()
print(water_xyz)
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'sphere':{}})
xyzview.zoomTo()
xyzview.show()

#Plotting the SOAP vector (\vec{S}) trackerPos
dictCube_test = read_cube(Cubefile(1))   #Reads to a dict
water = dictCube_test['atoms']
soap_vec=soap_water.create(water, positions=[trackerPos])
plt.bar(np.arange(0,len(soap_vec[0])),soap_vec[0], label=str(trackerPos))
plt.legend(frameon=False)
plt.show()
4
Lattice="8.098760339264464 0.0 0.0 0.0 8.098760339264464 0.0 0.0 0.0 8.098760339264464" Properties=species:S:1:pos:R:3 pbc="F F F"
O        4.41111275       3.70146176       3.33748892
H        4.06914732       3.20866866       2.58745536
H        5.29014277       3.31683830       3.39014681
He      10.00000000       0.00000000       0.00000000

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

You can play around with the definition of SOAP by changing $\texttt{nmax}$, $\texttt{lmax}$, $\texttt{rcut}$ etc. Changing the number of radial and spherical harmonics will change the length of the SOAP vector. The larger the vector, the more information is inscribed, but that also increases the cost of generating, storing and learning these quantities. These are quantities that do not have a lot of physical meaning. However, the SOAP vector uses distance dependent functions. This aspect can to be used to put some physical meaning to these plots. A point chosen very further away from the water molecule will have a smaller amplitude overall. This information can be used to reduce the number of points needed to train a system, since we know that when the electron density is zero then the all points $\vec{S}$ will become zero.

Building an MLP regressor using sklearn

The next task is to map this information to electron densities obtained from DFT. We used cp2k to generate a bunch of electron densities for water using temperature controlled Nose-Hoover (NVT) MD @300 K. There are cube files generated at different points of the MD simulation that will be used as training data for an MLP regressor (neural network) using sklearn. I do acknowledge that this is not deep learning and the model may be a bit too simplified, but since this is just an introductory problem, I will stick to simpler ways of generating ML models. Further problems with more complex systems will use different ways of training NNs.

The flow of building this ML framework can be thought of in this way:

  • Each point in space ($\vec{r}$) is a training sample.
  • For each $\vec{r}$ we generate $\vec{S}$ using the SOAP function for the system, which go in as input to a NN.
  • Each $\vec{S}$ is mapped to the corresponding electron density at $\vec{r}$.
  • The length of $\vec{S}$ act as features for the input data. Thus, a system with different nmax and lmax that lead to a differently sized $\vec{S}$ cannot be used for training or predictions.

Reading input data

As each cube file from this simulation contains more than 60,000 data points, we have limitations to how many cube files we can include in training due to sheer size. For this example, we shall only use 5. For each cube file, we first read the atomic positions and electron densities. We use this atomic positions to get a SOAP vector with the help of the SOAP function for water. SOAP vectors generated from consequent cube files are then vertically stacked below the previous iteration. The electron densities read from each cube file are flattended to a 1-D vector and then appended to form a very large 1-D vector.

In [13]:
samp = np.arange(1,20,4)          # Using only 5 samples

# Create an ASE atoms object for water
dictCube_test = read_cube(Cubefile(1))   #Reads to a dict
water_ase = dictCube_test['atoms']     #Atom centers from ase Atoms format

rho = []                        # List to store electron densities       
j = 0                           # A flag for the first cube file
for i in samp:
    print(f'Read file: {Cubefile(i)}')
    dictCube = read_cube(Cubefile(i))           #Reads to a dict
    center = dictCube['atoms'].get_positions()   # Atom centers from ase Atoms format
    rho_inter = dictCube['data']
    rho_i_trim=rho_inter.ravel().reshape(-1,1)                                                    #Flattening the 3D matrix of densities
    rho_i = list(map(float,rho_i_trim))                 #Just to change the change and map to a list
#     print(rho_i)
    if j == 1:
        water_ase.set_positions([*center])            # Change the atomic positions for each cube file
        soap_vec = soap_water.create(water_ase, positions=water_box)
        X_i = soap_vec
        X_train = np.vstack([X_train, X_i])
    if j == 0:
        water_ase.set_positions([*center])
        soap_vec = soap_water.create(water_ase, positions=water_box)
        X_train = soap_vec
        j = 1
    rho += rho_i # List addition
rho = np.array(rho)
Read file: electron_den_data/water/eden_1.cube
Read file: electron_den_data/water/eden_5.cube
Read file: electron_den_data/water/eden_9.cube
Read file: electron_den_data/water/eden_13.cube
Read file: electron_den_data/water/eden_17.cube

Now we have both input (X) and output (Y) data to train a machine learning model. We can try understanding how these data sets look like.

In [14]:
print(f'Shape of input training data: {X_train.shape}')
print(f'Shape of output data: {rho.shape}')
print(f'Size of X_train: {int(X_train.nbytes)*1e-6} MB')
Shape of input training data: (344605, 180)
Shape of output data: (344605,)
Size of X_train: 496.2312 MB

The 344605 is composed of 5 sets of 68921 points. And the 180 comes from the size of each SOAP vector, which act as features of the model. An important point to note is the size of this input data. For such a small system like water, using only 5 images led to a input data set of about 500 MB. This will drastically increase with larger more complex systems. Thus, it is important to take measures to reduce the size of our input data without having to compromise on the quality of the data.

Training a model

We will use a multi-layered Perceptron regressor to train a model that can predict electron densities given the atomic positions. We will use three hidden layers of depth 90, 60, 30. The solver can iterate till 300 steps. The batch size for training is 500. We will try keeping a trach of time taken for training as well.

In [15]:
start = time.time()
model_water_1 = MLPRegressor(hidden_layer_sizes=(150,150,150),
                           random_state=1,
                           max_iter=300, 
                           shuffle=True,
                           batch_size=500).fit(X_train, rho) ## Building NN
end = time.time()
print("Training: ", (end - start), " seconds.")

file_name = 'saved_model/water/model_1.sav'
pickle.dump(model_water_1, open(file_name,'wb'))
Training:  47.263784885406494  seconds.
In [16]:
plt.plot(model_water_1.loss_curve_)
plt.xlabel('Training step (skipped)')
plt.ylabel('Loss function')
Out[16]:
Text(0, 0.5, 'Loss function')

The model took just over 20 seconds to train. I was in Bangalore virtually connected to my Pittsburgh computers at the time of training this model, so maybe the time to train the model may not be completely accurate. Based on the loss function curve, it seems to have converged. Another major issue with using sklearn to train models is the difficulty in using a custom loss function. This can drastically improve the quality of the model. The default loss functino is MSE (mean squared error).

Testing the model

We will try predicting the electron density using an unseen data set and compute the % accuracy.

In [17]:
# We will try testing our model with an unseen configurtion (15).

dictCube_test = read_cube(Cubefile(15))                #Reads to a dict
center_test_water = dictCube_test['atoms']   # Atom centers from ase Atoms format
center_test = center_test_water.get_positions()
rho_test = dictCube_test['data']
rho_test = rho_test.ravel().reshape(-1,1)
start = time.time()
water_ase.set_positions([*center_test])
soap_water_test = soap_water.create(water_ase, positions=water_box)
rho_pred = model_water_1.predict(soap_water_test)
end = time.time()
print("Prediction time: ", (end - start), " seconds.")

acc = model_water_1.score(soap_water_test, rho_test)
print(f"Prediction accuracy: {acc*100:1.2f} %")
Prediction time:  0.8577415943145752  seconds.
Prediction accuracy: 95.41 %
In [18]:
plt.plot(rho_test,'k', label = r'cp2k $\rho$',alpha=0.3)
plt.plot(rho_pred ,'r', label = r'Predicted $\rho$',alpha=0.5)
plt.legend(frameon=False)
plt.xlabel('Grid point #')
plt.ylabel(r'$\rho(\vec{r})$')
plt.show()

We get an accuracy of just about 92 %, which is not too bad for a starting model. There is still a huge room for improvement with the predictions of this model. We shall try playing with the representation of SOAP to see if that improves anything.

Plotting this one dimensionally gave us some idea of how well the predicted densities look as compared to the DFT ones. Next we will try plotting the same using py3Dmol. For that, we first have to save this as a cube file and then load the cube file for plotting.

In [19]:
out=open('electron_den_data/water/pred_15.cube','w+')
rho_test_resp = rho_pred.reshape(x_f,y_f,z_f)
write_cube(out,center_test_water,rho_test_resp,origin=water_box[0])
In [20]:
water_cube_predict=open('electron_den_data/water/pred_15.cube').read()
water_cube_dft=open('electron_den_data/water/eden_15.cube').read()

cube2xyz('electron_den_data/water/eden_15.cube','electron_den_data/water/eden_15.xyz')
water_xyz=open('electron_den_data/water/eden_15.xyz').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube_predict,'cube',{'isoval': 0.03,"color":'red','opacity':0.75,'wireframe':True})
xyzview.addVolumetricData(water_cube_dft,'cube',{'isoval': 0.03,"color":'grey','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

From the above plot, the red is the prediction and the grey is DFT. We can't really see a major difference in predictions. Another way to compare these results would be to plot the absolute error in predicting these densities.

In [21]:
# Plotting the absolute error in prediction 
error = np.abs(rho_test.reshape(1,-1)[0] - rho_pred)
out=open('electron_den_data/water/abs_error_15.cube','w+')
error_resp = error.reshape(x_f,y_f,z_f)
write_cube(out,center_test_water,error_resp,origin=water_box[0])
In [22]:
water_cube_error=open('electron_den_data/water/abs_error_15.cube').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube_error,'cube',{'isoval': 0.03,"color":'blue','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

For an isovalue of 0.03 we get a pretty large error surface. We can also try plotting the same one dimensionally.

In [23]:
plt.plot(error,label='Error')
plt.xlabel('Grid point #')
plt.ylabel(r'|$\rho_{pred} - \rho_{DFT}$|')
plt.legend(frameon=False)
plt.show()

This shows a significant issue in the ability of the model to predict densities accuractely.

In [24]:
plt.plot(rho_test,'k', label = r'cp2k $\rho$',alpha=0.3)
plt.plot(rho_pred ,'r', label = r'Predicted $\rho$',alpha=0.5)
plt.legend(frameon=False)
plt.xlabel('Grid point #')
plt.ylabel(r'$\rho(\vec{r})$')
plt.ylim(-0.01,0.01)
plt.title('Zoomed in')
plt.show()

A negative prediction in densities is a major concern. We would need some reforming in the SOAP vector generation to change this.

Modifications to the SOAP function

There are few thing that the previous definition of SOAP suffered from:

  1. The overall shape of the density didn't similar to the DFT profile.
  2. The isosurface of the absolute errors in predictions showed clear differences.
  3. Zoomed in plot of the 1-D densities showed large negative electron densities, which is unphysical.

These errors could be because of the representation of SOAP. Our previous definition made use of a larger cutoff radius, 5 Angs. A large cutoff leads to an environment that is less local. With a larger cutoff radius, all SOAP vector tend to have lesser differences between each other. There is no definite way to decide on an accurate cutoff radius, but we can get an idea by knowing the average bond distance of the system.

Another critical aspect for SOAP is the weighting. BY default the SOAP formalism weights the atomic densities equally irrespective of the distance of from the position of interest. Apparently, not weighting your SOAP may lead to farther away atoms dominating the spectrum. By weighting we give more importance to closer-by atomic interactions. More information about the benifits of weighting can be found in the the Dscribe page for SOAP.

We used the polynomial weighting function $$ w(r) = \begin{array}{ll} c(1 + 2 (\frac{r}{r_0})^{3} -3 (\frac{r}{r_0})^{2}))^{m}, \ \text{for}\ r \leq r_0\\ 0, \ \text{for}\ r > r_0 \end{array} $$

In [31]:
species = ["H", "O"]   
rcut = 2.0      # Cut off radius in Angs
nmax = 4      # number of radial basis functions
lmax = 4      # number of spherical harmonics

# Setting up the SOAP descriptor
soap_water = SOAP(
    species=species,
    periodic=False,
    rcut=rcut,
    nmax=nmax,
    lmax=lmax,
    weighting={"function":"poly","r0":1.5,"c":2,"m":2}
)
In [21]:
# Generating training dataset
samp = np.arange(1,20,4)          # Using only 5 samples

# Create an ASE atoms object for water
dictCube_test = read_cube(Cubefile(1))   #Reads to a dict
water_ase = dictCube_test['atoms']     #Atom centers from ase Atoms format

rho = []                        # List to store electron densities       
j = 0                           # A flag for the first cube file
for i in samp:
    print(f'Read file: {Cubefile(i)}')
    dictCube = read_cube(Cubefile(i))           #Reads to a dict
    center = dictCube['atoms'].get_positions()   # Atom centers from ase Atoms format
    rho_inter = dictCube['data']
    rho_i_trim=rho_inter.ravel().reshape(-1,1)                                                    #Flattening the 3D matrix of densities
    rho_i = list(map(float,rho_i_trim))                 #Just to change the change and map to a list
#     print(rho_i)
    if j == 1:
        water_ase.set_positions([*center])            # Change the atomic positions for each cube file
        soap_vec = soap_water.create(water_ase, positions=water_box)
        X_i = soap_vec
        X_train = np.vstack([X_train, X_i])
    if j == 0:
        water_ase.set_positions([*center])
        soap_vec = soap_water.create(water_ase, positions=water_box)
        X_train = soap_vec
        j = 1
    rho += rho_i # List addition
rho = np.array(rho)
Read file: electron_den_data/water/eden_1.cube
Read file: electron_den_data/water/eden_5.cube
Read file: electron_den_data/water/eden_9.cube
Read file: electron_den_data/water/eden_13.cube
Read file: electron_den_data/water/eden_17.cube
In [22]:
# training a model
start = time.time()
model_water_2 = MLPRegressor(hidden_layer_sizes=(100,100,100),
                           random_state=1,
                           max_iter=300, 
                           shuffle=True,
                           batch_size=500).fit(X_train, rho) ## Building NN
end = time.time()
print("Training: ", (end - start), " seconds.")

file_name = 'saved_model/water/model_2.sav'
pickle.dump(model_water_2, open(file_name,'wb'))
Training:  26.642443895339966  seconds.
In [23]:
# We will try testing our model with an unseen configurtion (15).

dictCube_test = read_cube(Cubefile(15))                #Reads to a dict
center_test_water = dictCube_test['atoms']   # Atom centers from ase Atoms format
center_test = center_test_water.get_positions()
rho_test = dictCube_test['data']
rho_test = rho_test.ravel().reshape(-1,1)
start = time.time()
water_ase.set_positions([*center_test])
soap_water_test = soap_water.create(water_ase, positions=water_box)
rho_pred = model_water_2.predict(soap_water_test)
end = time.time()
print("Prediction time: ", (end - start), " seconds.")

acc = model_water_2.score(soap_water_test, rho_test)
print(f"Prediction accuracy: {acc*100:1.2f} %")
Prediction time:  0.754737138748169  seconds.
Prediction accuracy: 98.39 %

Our new model now has an accuracy of about 99%. This is a hugh improvement.

In [24]:
out=open('electron_den_data/water/pred_15.cube','w+')
rho_test_resp = rho_pred.reshape(x_f,y_f,z_f)
write_cube(out,center_test_water,rho_test_resp,origin=water_box[0])
In [30]:
water_cube_predict=open('electron_den_data/water/pred_15.cube').read()
water_cube_dft=open('electron_den_data/water/eden_15.cube').read()

cube2xyz('electron_den_data/water/eden_15.cube','electron_den_data/water/eden_15.xyz')
water_xyz=open('electron_den_data/water/eden_15.xyz').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube_predict,'cube',{'isoval': 0.03,"color":'red','opacity':0.75,'wireframe':True})
xyzview.addVolumetricData(water_cube_dft,'cube',{'isoval': 0.03,"color":'grey','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()
print("DeepCDP: Red \t Cp2k: Grey")

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

DeepCDP: Red 	 Cp2k: Grey
In [25]:
# Calculate the total number of electron predicted by cp2k and DeepCDP
dictCube_test = read_cube(Cubefile(15))
cp2k_den = dictCube_test['data']
dictCube_cdp = read_cube('electron_den_data/water/pred_15.cube')
cdp_den = dictCube_cdp['data']
In [26]:
print(f"Cp2K: Total number of electrons: {np.sum(cp2k_den)*diffV:1.2f}")
print(f"DeepCDP: Total number of electrons: {np.sum(cdp_den)*diffV:1.2f}")
Cp2K: Total number of electrons: 8.00
DeepCDP: Total number of electrons: 8.59

The model overestimates the total number of electrons in the system by more than 1/2 of an electron. Probably something that needs to be fixed with proper training or adding summation constraints.

In [26]:
# Plotting the absolute error in prediction 
error = np.abs(rho_test.reshape(1,-1)[0] - rho_pred)
out=open('electron_den_data/water/abs_error_15.cube','w+')
error_resp = error.reshape(x_f,y_f,z_f)
write_cube(out,center_test_water,error_resp,origin=water_box[0])
In [27]:
water_cube_error=open('electron_den_data/water/abs_error_15.cube').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube_error,'cube',{'isoval': 0.03,"color":'blue','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

The isosurface plots of absolute errors looks even better, with slight deviations only near the center of the O atom.

In [28]:
plt.plot(error,label='Error')
plt.xlabel('Grid point #')
plt.ylabel(r'|$\rho_{pred} - \rho_{DFT}$|')
plt.legend(frameon=False)
plt.show()
In [29]:
plt.plot(rho_test,'k', label = r'cp2k $\rho$',alpha=0.3)
plt.plot(rho_pred ,'r', label = r'Predicted $\rho$',alpha=0.5)
plt.legend(frameon=False)
plt.xlabel('Grid point #')
plt.ylabel(r'$\rho(\vec{r})$')
plt.ylim(-0.01,0.01)
plt.title('Zoomed in')
plt.show()

Another major improvement with the quality is the removal of negative densities. There is still a slight positive bias that can be potentially removed by further training.

With these improvements, we can potentially use this for predictions. Just being able to predict the system that DFT generates might not be enough. There are two things to test the extrapolative nature of our model:

  1. How well it scales for systems with more than one water molecules?
  2. Is it capable of predicting systems that aren't water, but contain H and O in them?

Scaling capabilites of our model

I will demonstrate how this model can be used for larger system sizes. It is essential to keep in mind that with larger system sizes we get more number of data points and thus more number of atoms. Since the construction of SOAP is dependent on the total number of atomic species than the number of atoms, it becomes transferrable.

For this example, we will use a system with three water molecules inside a a 10x10x10 cell

In [32]:
water_xyz=open('electron_den_data/water/water_large.xyz').read()
# print(water_xyz)
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We will set the size of the cell and also make the system periodic.

In [33]:
water_large_ase = read('electron_den_data/water/water_large.xyz')
water_large_ase.set_cell([10,10,10])
water_large_ase.set_pbc([True, True, True])

Since we do not have a reference dataset to compare our predictions with, we can define the size and resolution of the cell as we like.

In [34]:
water_box_large = create_box(90, 
                      cell_spacing[0,0],
                      cell_spacing[1,0],
                      cell_spacing[1,1],
                      cell_spacing[2,2],
                      x_ini=0,
                      x_fin=50,
                      y_ini=0,
                      y_fin=50,
                      z_ini=0,
                      z_fin=50)

We will load the latest model for water using pickle load.

In [35]:
with open('saved_model/water/model_2.sav', 'rb') as file:
    model_water = pickle.load(file)
In [36]:
# Generating the SOAP vectors using the old SOAP function
soap_large=soap_water.create(water_large_ase, positions=water_box_large)
In [37]:
soap_large.shape

# Has the same shape as what was needed. 
Out[37]:
(125000, 180)
In [38]:
# Prediction
rho_pred_large=model_water.predict(soap_large)
In [39]:
out=open('electron_den_data/water/water_large.cube','w+')
rho_test_resp = rho_pred_large.reshape(50,50,50)
write_cube(out,water_large_ase,rho_test_resp,origin=water_box_large[0])
In [40]:
water_xyz=open('electron_den_data/water/water_large.xyz').read()
water_cube_error=open('electron_den_data/water/water_large.cube').read()
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addVolumetricData(water_cube_error,'cube',{'isoval': 0.03,"color":'blue','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

In [41]:
dictCube_cdp = read_cube('electron_den_data/water/water_large.cube')
cdp_den = dictCube_cdp['data']
print(f"DeepCDP: Total number of electrons: {np.sum(cdp_den)*diffV:1.2f}")
DeepCDP: Total number of electrons: 24.33

This looks physically acceptable. Since there isn't a reference to compare our predictions, we would have to develop a metric to validate these results.

We found that the total number of electrons in the system came up to be 24.33. There's an extra 1/3rd of an electron added to the system. This is surprisingly good because the same model

There are few important things to note from this:

  1. Our model is able to predict for larger system sizes with periodic boundaries
  2. The model is capable of understanding overlap between orbitals, though we do not know if these results are completely true.
  3. Our model also shows smoothing in the densities.

We still haven't tested our limits with the size of the system we could perform predictions. For the next section, we will use a system with 100 water molecules in a large cell. We extracted an MD snapshot from http://www.ergoscf.org/xyz/h2o.php performed by Daniel Spångberg at Uppsala University, Department of Materials Chemistry.

In [47]:
water_xyz=open('electron_den_data/water/ergoscf_009.xyz').read()
# print(water_xyz)
xyzview = py3Dmol.view(width=400,height=400)
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We will set the size of the box to based on the number of grid points along each axis and the cell spacing. By setting the number of points to 100, we get cell parameter of 19.75 Angs.

In [48]:
# Loading the system and then setting the cell shape and pbc
a=100*Bohr2Ang*cell_spacing[0,0]
print(f'Size of cell parameter: {a:1.2f} Angs')
water_large_ase = read('electron_den_data/water/ergoscf_009.xyz')
water_large_ase.set_cell([a,a,a])
water_large_ase.set_pbc([True, True, True])
Size of cell parameter: 19.75 Angs

For the first task, we will try to generate the electron density for the entire box. Keep in mind that now we are dealing with a lot of points.

In [37]:
# Creating a box for the entire cell. With the 100 grid points across each of the three dimensions we approximately reach
# the 
water_box_whole = create_box(90, 
                      cell_spacing[0,0],
                      cell_spacing[1,0],
                      cell_spacing[1,1],
                      cell_spacing[2,2],
                      x_ini=0,
                      x_fin=100,
                      y_ini=0,
                      y_fin=100,
                      z_ini=0,
                      z_fin=100)
In [38]:
# Generating the SOAP vectors using the old SOAP function. Takes a lot of time for the entire system
start = time.time()
soap_large=soap_water.create(water_large_ase, positions=water_box_whole)
end = time.time()
soap_time = end - start
print(f'SOAP generation took: {soap_time:1.2f} seconds')
SOAP generation took: 15.17 seconds
In [39]:
# Prediction
start = time.time()
rho_pred_large=model_water.predict(soap_large)
end = time.time()
soap_time = end - start
print(f'Prediction took: {soap_time:1.2f} seconds')
Prediction took: 3.75 seconds

This is a huge load on the SOAP function to process so many points. And larger system sizes will suffer even more due to the number of grid points. There are several ways to take the issues of memory:

  1. To reduce the resolution of box; i.e., skip some points in between to reduce the load on SOAP generation
  2. Area specific generation of SOAP; e.g. 2-D systems that are periodic have huge vacuum regions on top and bottom where the electron densities are 0. We can narrow the prediction area to only where the atoms are to minimize load.
  3. Break the system to N number of sub-problems based on regions, where the densities of each region are predicted independently. This is possible because any point in space only needs the atomic coordinates to generate the SOAP vector and not other grid points in space. With sufficient parallization this prediction can be done in short time. If there are K points and K processors where each point takes 2 second to predict electron densities, then the entire cell would just take 2 seconds on K processors.

We will be using the 3rd method for the large water cluster system. The system will be broken down to 8 cubic octants, and the densities will be predicted for each of these 8 octants independently. We will not demonstrate parallelization, but try setting up a parallel problem and solve it serially. As the results from one octant are not needed for the predictions of the next octant, we will delete the intermediate results to save temporary memory.

In [40]:
# Deleting the variables that were saved to increase temp space.
del water_box_whole, soap_large
In [49]:
# A function that replaces 
def change_spacing_cube(cubefile,cell_spacing):
    '''
    The write cube function by ASE automatically sets the spacing
    between grid points based on the input. For smaller sections of
    the system, we would need to smaller cell spacing that what ASE 
    assigns. This function rewrites the cube file for the spacing 
    that we specify. 
    '''
    file=open(cubefile,'r')
    file_read = file.readlines()
    x_line=file_read[3].split()
    y_line=file_read[4].split()
    z_line=file_read[5].split()
    x_line[1:]=cell_spacing[0,:].astype(str)
    y_line[1:]=cell_spacing[1,:].astype(str)
    z_line[1:]=cell_spacing[2,:].astype(str)
    file_read[3] = "    ".join(x_line) + "\n"
    file_read[4] = "    ".join(y_line) + "\n"
    file_read[5] = "    ".join(z_line) + "\n"
    fout=open(cubefile,'w')
    for elements in file_read:
        fout.write(elements)
    fout.close()
In [52]:
# Octant 1
def predict_soap_write(x_i, x_f, y_i, y_f, z_i, z_f, outpath,
                      ase_file_atomic,soap_funct,model,cell_spacing=None):
    '''
    This function does the following:
    1. Creates a box with points based on the cell spacing and initial 
    and final x, y and z points.
    2. Generates a SOAP vector using the points created above.
    3. Performs prediction for the set of points.
    4. Writes the prediction to outpath as a cube file.
    5. If specified, changes the cell spacing based on user input
    '''
    box_oct = create_box(90, 
                          cell_spacing[0,0],
                          cell_spacing[1,0],
                          cell_spacing[1,1],
                          cell_spacing[2,2],
                          x_ini=x_i,
                          x_fin=x_f,
                          y_ini=y_i,
                          y_fin=y_f,
                          z_ini=z_i,
                          z_fin=z_f)
    start = time.time()
    soap_oct=soap_funct.create(ase_file_atomic, positions=box_oct)
    end = time.time()
    soap_time = end - start
    print(f'SOAP generation took: {soap_time:1.2f} seconds')
    
    start = time.time()
    rho_pred_oct=model.predict(soap_oct)
    end = time.time()
    pred_time = end - start
    print(f'Prediction took: {pred_time:1.2f} seconds')
    
    out=open(outpath,'w+')
    rho_test_resp = rho_pred_oct.reshape(x_f-x_i, y_f-y_i ,z_f-z_i)
    write_cube(out,ase_file_atomic,rho_test_resp,origin=box_oct[0])
    
    if cell_spacing is not None:
        change_spacing_cube(outpath,cell_spacing)
        
    print(f"Generated {outpath}")
    
    del box_oct, soap_oct, rho_pred_oct, rho_test_resp
    
    
In [53]:
#Octant 1
predict_soap_write(0,50,0,50,0,50,  # 000
                  'electron_den_data/water/ergoscf_009_oc1.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)
#Octant 2
predict_soap_write(50,100,0,50,0,50,  #100
                  'electron_den_data/water/ergoscf_009_oc2.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)
SOAP generation took: 1.70 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc1.cube
SOAP generation took: 1.87 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc2.cube
In [54]:
#Octant 3
predict_soap_write(0,50,50,100,0,50,  #010
                  'electron_den_data/water/ergoscf_009_oc3.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)

#Octant 4
predict_soap_write(50,100,50,100,0,50, #110
                  'electron_den_data/water/ergoscf_009_oc4.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)
SOAP generation took: 1.89 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc3.cube
SOAP generation took: 1.84 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc4.cube
In [55]:
#Octant 5
predict_soap_write(0,50,0,50,50,100,  #001
                  'electron_den_data/water/ergoscf_009_oc5.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)

#Octant 6
predict_soap_write(50,100,50,100,50,100, #111
                  'electron_den_data/water/ergoscf_009_oc6.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)
SOAP generation took: 1.87 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc5.cube
SOAP generation took: 2.06 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc6.cube
In [56]:
#Octant 7
predict_soap_write(0,50,50,100,50,100, #011
                  'electron_den_data/water/ergoscf_009_oc7.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)

#Octant 8
predict_soap_write(50,100,0,50,50,100, #101
                  'electron_den_data/water/ergoscf_009_oc8.cube',
                 water_large_ase, soap_water,model_water,cell_spacing)
SOAP generation took: 1.95 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc7.cube
SOAP generation took: 1.97 seconds
Prediction took: 0.47 seconds
Generated electron_den_data/water/ergoscf_009_oc8.cube

Each of these octants took less than 2 seconds to generate SOAP vectors and 0.47 seconds to predict densities. All these 8 tasks could have ideally been performed independently in a parallel manner.

In [50]:
water_xyz=open('electron_den_data/water/ergoscf_009.xyz').read()
water_cube_oct1=open('electron_den_data/water/ergoscf_009_oc1.cube').read()
water_cube_oct2=open('electron_den_data/water/ergoscf_009_oc2.cube').read()
water_cube_oct3=open('electron_den_data/water/ergoscf_009_oc3.cube').read()
water_cube_oct4=open('electron_den_data/water/ergoscf_009_oc4.cube').read()
xyzview = py3Dmol.view(width=400,height=400)
# xyzview.addVolumetricData(water_cube_oct1,'cube',{'isoval': 0.03,"color":'blue','opacity':0.75,'wireframe':True})
# xyzview.addVolumetricData(water_cube_oct2,'cube',{'isoval': 0.03,"color":'green','opacity':0.75,'wireframe':True})
xyzview.addVolumetricData(water_cube_oct3,'cube',{'isoval': 0.03,"color":'red','opacity':0.75,'wireframe':True})
xyzview.addVolumetricData(water_cube_oct4,'cube',{'isoval': 0.03,"color":'grey','opacity':0.75,'wireframe':True})
xyzview.addModel(water_xyz,'xyz')
xyzview.setStyle({'stick':{}})
xyzview.zoomTo()
xyzview.show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

By generating electron densities for such small sections of the system we have essentially solved a major memory issue with this method.

In [57]:
no_elec=0
for i in range(1,9):
    dictCube_cdp = read_cube(f'electron_den_data/water/ergoscf_009_oc{i}.cube')
    cdp_den = dictCube_cdp['data']
    no_elec += np.sum(cdp_den)*diffV
    
print(f'Total number of electron predicted by DeepCDP for 100 water molecules: {no_elec:1.2f}')
Total number of electron predicted by DeepCDP for 100 water molecules: 753.24

Seems like the prediction is about 47 short of the expected number of electrons in the system. Maybe this is due to the tiny discontinuity that arrises at the prediction boundaries.